19  Predictive Analytics - Practical

19.1 Introduction

The goal of this session is to cover some basic techniques we can use to explore our data in terms of the relationships and differences between variables. We’ll also review the different types of variables that are encountered in sport data.

The pre-class reading for this week was fairly high-level, and introduced a series of techniques that we’ll cover in detail in Semester Two.

For now, we’ll focus on some simpler approaches to our data, which can still identify interesting and important features of our data. This will help us revise some concepts you should have covered at Undergraduate level, and learn how to implement these in R.

There is a fundamental assumption here that the relationships we identify in existing historical data will continue in the future. Therefore, while the approaches discussed below are primarily about identifying relationships or differences in the past, we may wish to use them to think about what might happen in the future.

19.2 A recap of variable types

19.3 Demonstration - Exploring Relationships

I’m going to use create a dataset called [f1_data] related to Formula One.

I’ll examine the structure of my data before doing anything else:

head(f1_data)
  driver_age years_experience         team qualifying_position
1         34                6     Mercedes           11.850951
2         38               14 Aston Martin           13.886814
3         33                7 Aston Martin            4.111364
4         22               10     Red Bull           12.115860
5         29                5     Red Bull           19.817601
6         37                6 Aston Martin            3.402618
  constructor_championships podiums race_wins avg_race_finish
1                         2      24         1       14.872417
2                         1      52        19        9.040418
3                         2      33        17        7.116612
4                         2      29        10        9.056831
5                         0      16         7       11.349727
6                         5      25         8       13.936906
summary(f1_data)
   driver_age    years_experience     team           qualifying_position
 Min.   :20.00   Min.   : 1.00    Length:100         Min.   : 1.438     
 1st Qu.:26.00   1st Qu.: 6.75    Class :character   1st Qu.: 6.859     
 Median :31.00   Median :10.00    Mode  :character   Median :10.257     
 Mean   :30.66   Mean   :10.49                       Mean   : 9.933     
 3rd Qu.:35.00   3rd Qu.:14.00                       3rd Qu.:13.225     
 Max.   :40.00   Max.   :20.00                       Max.   :19.838     
 constructor_championships    podiums        race_wins     avg_race_finish 
 Min.   :0.00              Min.   :10.00   Min.   : 1.00   Min.   : 1.000  
 1st Qu.:1.00              1st Qu.:27.75   1st Qu.:10.00   1st Qu.: 6.881  
 Median :2.00              Median :35.50   Median :13.00   Median : 8.562  
 Mean   :2.35              Mean   :35.62   Mean   :13.17   Mean   : 8.591  
 3rd Qu.:4.00              3rd Qu.:45.00   3rd Qu.:16.25   3rd Qu.:10.089  
 Max.   :5.00              Max.   :68.00   Max.   :26.00   Max.   :15.688  

19.3.1 Descriptive Statistics and Correlations

A good first step is to calculate some summary statistics and correlation between the continuous variables. This helps to understand the general trends and relationships.

# Descriptive statistics
summary(f1_data)
   driver_age    years_experience     team           qualifying_position
 Min.   :20.00   Min.   : 1.00    Length:100         Min.   : 1.438     
 1st Qu.:26.00   1st Qu.: 6.75    Class :character   1st Qu.: 6.859     
 Median :31.00   Median :10.00    Mode  :character   Median :10.257     
 Mean   :30.66   Mean   :10.49                       Mean   : 9.933     
 3rd Qu.:35.00   3rd Qu.:14.00                       3rd Qu.:13.225     
 Max.   :40.00   Max.   :20.00                       Max.   :19.838     
 constructor_championships    podiums        race_wins     avg_race_finish 
 Min.   :0.00              Min.   :10.00   Min.   : 1.00   Min.   : 1.000  
 1st Qu.:1.00              1st Qu.:27.75   1st Qu.:10.00   1st Qu.: 6.881  
 Median :2.00              Median :35.50   Median :13.00   Median : 8.562  
 Mean   :2.35              Mean   :35.62   Mean   :13.17   Mean   : 8.591  
 3rd Qu.:4.00              3rd Qu.:45.00   3rd Qu.:16.25   3rd Qu.:10.089  
 Max.   :5.00              Max.   :68.00   Max.   :26.00   Max.   :15.688  
# Correlation matrix for continuous variables
cor(f1_data[, c("driver_age", "podiums", "years_experience", "race_wins", "avg_race_finish", "qualifying_position")])
                     driver_age     podiums years_experience   race_wins
driver_age           1.00000000  0.04018587       0.05843373  0.03751335
podiums              0.04018587  1.00000000       0.89587460  0.72258091
years_experience     0.05843373  0.89587460       1.00000000  0.61363657
race_wins            0.03751335  0.72258091       0.61363657  1.00000000
avg_race_finish     -0.04245864 -0.33639896      -0.28298182 -0.66515837
qualifying_position -0.06812982  0.02712041       0.09338054 -0.14432972
                    avg_race_finish qualifying_position
driver_age              -0.04245864         -0.06812982
podiums                 -0.33639896          0.02712041
years_experience        -0.28298182          0.09338054
race_wins               -0.66515837         -0.14432972
avg_race_finish          1.00000000          0.38097763
qualifying_position      0.38097763          1.00000000

The summary() function provides basic summary statistics (mean, median, min, max) for the numeric variables.

The cor() function computes the correlation matrix, showing relationships between continuous variables like age, podium finishes, race wins, and more.

The correlation coefficient tells us the magnitude (size) of the relationship, and the direction of the relationship.

However, we have no way of knowing whether these relationships are significant or not. For that, we need to calculate the p-value for each correlation.

In the next example, I’ve used the correlation package to compute the correlation matrix with p-values, and get a SPSS/Stata-type output.

library(correlation)
Warning: package 'correlation' was built under R version 4.4.1
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
sapply(f1_data, class)
               driver_age          years_experience                      team 
                "integer"                 "integer"               "character" 
      qualifying_position constructor_championships                   podiums 
                "numeric"                 "integer"                 "numeric" 
                race_wins           avg_race_finish 
                "numeric"                 "numeric" 
# Select only numeric columns using dplyr
f1_numeric <- f1_data %>%
  select_if(is.numeric)

# View the filtered data with only numeric columns
str(f1_numeric)
'data.frame':   100 obs. of  7 variables:
 $ driver_age               : int  34 38 33 22 29 37 30 24 39 33 ...
 $ years_experience         : int  6 14 7 10 5 6 16 11 4 12 ...
 $ qualifying_position      : num  11.85 13.89 4.11 12.12 19.82 ...
 $ constructor_championships: int  2 1 2 2 0 5 1 1 1 3 ...
 $ podiums                  : num  24 52 33 29 16 25 50 42 22 38 ...
 $ race_wins                : num  1 19 17 10 7 8 17 10 14 21 ...
 $ avg_race_finish          : num  14.87 9.04 7.12 9.06 11.35 ...
result <- correlation(f1_numeric)

# View the result
print(result)
# Correlation Matrix (pearson-method)

Parameter1                |                Parameter2 |     r |         95% CI | t(98) |         p
--------------------------------------------------------------------------------------------------
driver_age                |          years_experience |  0.06 | [-0.14,  0.25] |  0.58 | > .999   
driver_age                |       qualifying_position | -0.07 | [-0.26,  0.13] | -0.68 | > .999   
driver_age                | constructor_championships | -0.05 | [-0.24,  0.15] | -0.47 | > .999   
driver_age                |                   podiums |  0.04 | [-0.16,  0.23] |  0.40 | > .999   
driver_age                |                 race_wins |  0.04 | [-0.16,  0.23] |  0.37 | > .999   
driver_age                |           avg_race_finish | -0.04 | [-0.24,  0.16] | -0.42 | > .999   
years_experience          |       qualifying_position |  0.09 | [-0.10,  0.28] |  0.93 | > .999   
years_experience          | constructor_championships | -0.04 | [-0.23,  0.16] | -0.39 | > .999   
years_experience          |                   podiums |  0.90 | [ 0.85,  0.93] | 19.96 | < .001***
years_experience          |                 race_wins |  0.61 | [ 0.47,  0.72] |  7.69 | < .001***
years_experience          |           avg_race_finish | -0.28 | [-0.45, -0.09] | -2.92 | 0.065    
qualifying_position       | constructor_championships | -0.03 | [-0.23,  0.16] | -0.34 | > .999   
qualifying_position       |                   podiums |  0.03 | [-0.17,  0.22] |  0.27 | > .999   
qualifying_position       |                 race_wins | -0.14 | [-0.33,  0.05] | -1.44 | > .999   
qualifying_position       |           avg_race_finish |  0.38 | [ 0.20,  0.54] |  4.08 | 0.002**  
constructor_championships |                   podiums | -0.12 | [-0.31,  0.08] | -1.16 | > .999   
constructor_championships |                 race_wins | -0.07 | [-0.27,  0.13] | -0.73 | > .999   
constructor_championships |           avg_race_finish |  0.04 | [-0.15,  0.24] |  0.44 | > .999   
podiums                   |                 race_wins |  0.72 | [ 0.61,  0.80] | 10.35 | < .001***
podiums                   |           avg_race_finish | -0.34 | [-0.50, -0.15] | -3.54 | 0.010**  
race_wins                 |           avg_race_finish | -0.67 | [-0.76, -0.54] | -8.82 | < .001***

p-value adjustment method: Holm (1979)
Observations: 100

There are more complex ways to do this in Base R. This example captures a number of techniques we’ve previously covered, and is useful for revision.

# Function to calculate correlation matrix with p-values
cor_with_p_values <- function(f1_data) {
  # Get the number of variables
  n <- ncol(f1_data)
  
  # Create empty matrices for correlations and p-values
  corr_matrix <- matrix(NA, n, n)
  p_matrix <- matrix(NA, n, n)
  
  # Loop over variable pairs
  for (i in 1:n) {
    for (j in 1:n) {
      test <- cor.test(f1_data[, i], f1_data[, j])
      corr_matrix[i, j] <- test$estimate  # Correlation coefficient
      p_matrix[i, j] <- test$p.value      # P-value
    }
  }
  
  # Return list with both correlation and p-value matrices
  list(correlation = corr_matrix, p_values = p_matrix)
}

# Select relevant continuous variables from dataset
cont_vars <- f1_data[, c("driver_age", "podiums", "years_experience", "race_wins", "avg_race_finish", "qualifying_position")]

# Apply function to get correlation and p-value matrices
results <- cor_with_p_values(cont_vars)

# Round results for better readability
correlation_matrix <- round(results$correlation, 2)
p_value_matrix <- round(results$p_values, 4)  # Rounding p-values to 4 decimal places

# Display results
correlation_matrix
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
[1,]  1.00  0.04  0.06  0.04 -0.04 -0.07
[2,]  0.04  1.00  0.90  0.72 -0.34  0.03
[3,]  0.06  0.90  1.00  0.61 -0.28  0.09
[4,]  0.04  0.72  0.61  1.00 -0.67 -0.14
[5,] -0.04 -0.34 -0.28 -0.67  1.00  0.38
[6,] -0.07  0.03  0.09 -0.14  0.38  1.00
p_value_matrix
       [,1]   [,2]   [,3]  [,4]   [,5]   [,6]
[1,] 0.0000 0.6914 0.5636 0.711 0.6749 0.5006
[2,] 0.6914 0.0000 0.0000 0.000 0.0006 0.7888
[3,] 0.5636 0.0000 0.0000 0.000 0.0043 0.3554
[4,] 0.7110 0.0000 0.0000 0.000 0.0000 0.1520
[5,] 0.6749 0.0006 0.0043 0.000 0.0000 0.0001
[6,] 0.5006 0.7888 0.3554 0.152 0.0001 0.0000

19.3.1.1 Visualising Correlations: Correlation Heatmap

A correlation heatmap is a visual tool that makes it easier to interpret the relationships between variables. We can use ggplot2 and the reshape2 package in R to create one.

# Load necessary libraries
library(ggplot2)
library(reshape2)

# Reshape correlation matrix for visualization
melted_corr <- melt(correlation_matrix)

# Create the heatmap with variable labels
ggplot(data = melted_corr, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile(color = "white") +  # Add white gridlines for better visibility
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), 
                       name = "Correlation") + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),   # Rotate x-axis labels for better fit
        axis.text.y = element_text(size = 10),  # Adjust font size for y-axis labels
        axis.title.x = element_blank(),         # Remove axis titles for clean look
        axis.title.y = element_blank()) + 
  ggtitle("Correlation Heatmap with Variable Labels") +  # Add a title
  labs(x = "Variables", y = "Variables")

This code creates a heatmap that visually represents the correlation matrix.

  • Strong positive correlations are represented in red
  • Negative correlations in blue
  • Weak correlations near zero are in white.

This gives a quick visual summary of the relationships between variables.

19.3.1.2 Visualising Pairwise Relationships: Pair Plot (Scatterplot Matrix)

A pair plot, or scatterplot matrix, is another way to visualise the relationships between multiple variables at once. It creates scatterplots for each pair of variables, along with histograms for individual variables on the diagonal.

In this example, I’ve used the GGally library to create the plots.

# Load library
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
# Pair plot of selected continuous variables
ggpairs(f1_data[, c("driver_age", "podiums", "years_experience", "race_wins", "avg_race_finish", "qualifying_position")])

The ggpairs() function from the GGally package creates a grid of scatterplots, which makes it easy to visually explore pairwise relationships between continuous variables. This plot includes scatterplots for each pair and histograms for the distribution of individual variables along the diagonal.

19.3.1.3 Significance of Correlations: Visualising Significant Correlations with Stars

We can enhance the heatmap by marking correlations that are statistically significant. Significant correlations can be highlighted with stars or different colors based on the p-values.

# Prepare data for heatmap with significance stars
melted_corr_with_p <- melt(correlation_matrix)
melted_p_value <- melt(p_value_matrix)

# Add significance levels to the plot
melted_corr_with_p$significance <- cut(melted_p_value$value, breaks = c(-Inf, 0.001, 0.01, 0.05, Inf),
                                       labels = c("***", "**", "*", ""))

# Plot heatmap with significance levels
ggplot(data = melted_corr_with_p, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  geom_text(aes(label = significance), color = "black", size = 5) + 
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), 
                       name = "Correlation") + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + 
  ggtitle("Correlation Heatmap with Significance Levels") +
  labs(x = "Variables", y = "Variables")

This enhanced heatmap highlights statistically significant correlations with stars. The number of stars indicates the significance level (*** for p < 0.001, ** for p < 0.01, and * for p < 0.05). This visualisation helps quickly identify both the strength and statistical significance of correlations between variables.

19.3.2 Scatterplots to Visualise Relationships

Scatterplots are useful for visualising relationships between two continuous variables. Let’s explore the relationship between years of experience and podium finishes.

# Scatterplot: Years of Experience vs Podiums

plot(f1_data$years_experience, f1_data$podiums,
     xlab = "Years of Experience",
     ylab = "Number of Podium Finishes",
     main = "Years of Experience vs Podium Finishes")
abline(lm(podiums ~ years_experience, data = f1_data), col = "red") # Add trendline 

This scatterplot helps visualise the relationship between experience and podium finishes. The red line is a linear trendline that indicates any general trend (positive, negative, or no relationship).

19.3.3 Boxplots to Compare Groups

Boxplots allow us to compare distributions across different categories.

Here, we’ll compare the average race finish by team.

# Boxplot: Average Race Finish by Team

boxplot(avg_race_finish ~ team,
        data = f1_data,
        main = "Average Race Finish by Team",
        xlab = "Team",
        ylab = "Average Race Finish Position",
        col = c("lightblue", "lightgreen", "lightpink", "lightyellow", "lightgray"))

This boxplot shows how the average race finish position differs by team. We can see which teams tend to have higher or lower finishing positions on average.

19.3.4 Bar Charts for Categorical Variables

A bar chart is a simple but effective way to summarise categorical data.

Here, we’ll plot the number of drivers by constructor championships.

# Bar chart: Constructor Championships
barplot(table(f1_data$constructor_championships),
        main = "Number of Drivers by Constructor Championships",
        xlab = "Constructor Championships",
        ylab = "Number of Drivers",
        col = "steelblue")

This bar chart shows the distribution of drivers across different constructor championship counts. It’s a straightforward way to understand the frequency of a categorical variable.

19.3.5 Density Plot for Continuous Variables

Density plots give us a smooth estimate of the distribution of a continuous variable. Let’s visualize the distribution of qualifying positions.

# Density plot: Qualifying Position
plot(density(f1_data$qualifying_position),
     main = "Density Plot of Qualifying Positions",
     xlab = "Qualifying Position",
     col = "darkorange")

This density plot shows the distribution of qualifying positions. It helps in identifying where most qualifying times fall and whether the data is skewed.

19.3.6 Chi-Square Test for Categorical Variables

If we want to check the association between two categorical variables, like team and constructor championships, we can use a chi-square test.

# Chi-square test: Team vs Constructor Championships
chisq_test <- chisq.test(table(f1_data$team, f1_data$constructor_championships))
Warning in chisq.test(table(f1_data$team, f1_data$constructor_championships)):
Chi-squared approximation may be incorrect
chisq_test

    Pearson's Chi-squared test

data:  table(f1_data$team, f1_data$constructor_championships)
X-squared = 14.583, df = 20, p-value = 0.7997

The chi-square test assesses whether there is a significant association between team and the number of constructor championships. If the p-value is low, it suggests there is an association.

19.3.7 Multiple Linear Regression

Finally, we can use multiple linear regression to understand how multiple variables affect one outcome. For example, we can model how years of experience and team affect the number of podium finishes.

# Multiple linear regression: Predicting Podium Finishes
model <- lm(podiums ~ years_experience + team + race_wins + avg_race_finish, data = f1_data)
summary(model)

Call:
lm(formula = podiums ~ years_experience + team + race_wins + 
    avg_race_finish, data = f1_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.0261  -3.5254  -0.2822   3.2439  11.0097 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        4.8784     3.4984   1.394    0.167    
years_experience   1.5741     0.1111  14.174  < 2e-16 ***
teamFerrari        0.6708     1.5169   0.442    0.659    
teamMcLaren       -1.2811     1.5605  -0.821    0.414    
teamMercedes       0.1499     1.4608   0.103    0.919    
teamRed Bull      -2.2194     1.3586  -1.634    0.106    
race_wins          0.8874     0.1709   5.191 1.24e-06 ***
avg_race_finish    0.3664     0.2240   1.636    0.105    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.699 on 92 degrees of freedom
Multiple R-squared:  0.8637,    Adjusted R-squared:  0.8533 
F-statistic: 83.29 on 7 and 92 DF,  p-value: < 2.2e-16

This linear model looks at how experience, team, race wins, and average race finish predict the number of podium finishes. The summary() output provides detailed information about the significance and coefficients of the model.

We’ll cover linear regression in greater detail later in the module.

19.4 Demonstration - Exploring Differences

Exploring differences between variables involves comparing groups or distributions to see whether the observed differences are statistically significant and meaningful.

Here, I’ll introduce a few techniques for exploring these differences using both statistical and visual approaches. I’ll include examples for continuous and categorical variables.

19.4.1 Statistical Approaches for Exploring Differences

19.4.1.1 T-Test (for Continuous Variables Between Two Groups)

A t-test is used to compare the means of two groups for a continuous variable. It checks whether the difference between the group means is statistically significant.

For example, we might want to testing if the average number of podium finishes differs significantly between two Formula One teams (e.g., “Mercedes” and “Red Bull”).

# Subset data for two teams
team_data <- f1_data[f1_data$team %in% c("Mercedes", "Red Bull"), ]

# Conduct t-test to compare the average number of podiums between two teams
t_test_result <- t.test(podiums ~ team, data = team_data)
t_test_result

    Welch Two Sample t-test

data:  podiums by team
t = 0.82027, df = 41.143, p-value = 0.4168
alternative hypothesis: true difference in means between group Mercedes and group Red Bull is not equal to 0
95 percent confidence interval:
 -4.000684  9.474368
sample estimates:
mean in group Mercedes mean in group Red Bull 
              36.73684               34.00000 

The t-test provides a p-value, which tells us if the observed difference between the two group means (in this case, podium finishes) is statistically significant. A p-value below 0.05 typically indicates significance.

19.4.1.2 ANOVA (for Continuous Variables Between More Than Two Groups)

ANOVA (Analysis of Variance) is used to compare the means of a continuous variable across more than two groups.

For example, we might want to compare the average number of podiums among drivers from multiple teams.

# Conduct one-way ANOVA to compare podiums across different teams
anova_result <- aov(podiums ~ team, data = f1_data)
summary(anova_result)
            Df Sum Sq Mean Sq F value Pr(>F)
team         4    536   134.0   0.886  0.475
Residuals   95  14367   151.2               

The ANOVA test gives a p-value to indicate whether there is a statistically significant difference in the means of podiums across the different teams.

19.4.1.3 Chi-Square Test (for Categorical Variables)

The chi-square test is used to determine if there is a significant association between two categorical variables.

For example, we might want to test whether [constructor championships] and [team] are related.

# Conduct chi-square test to assess association between team and constructor championships
chisq_test_result <- chisq.test(table(f1_data$team, f1_data$constructor_championships))
Warning in chisq.test(table(f1_data$team, f1_data$constructor_championships)):
Chi-squared approximation may be incorrect
chisq_test_result

    Pearson's Chi-squared test

data:  table(f1_data$team, f1_data$constructor_championships)
X-squared = 14.583, df = 20, p-value = 0.7997

The chi-square test provides a p-value that indicates whether there is a statistically significant association between the two categorical variables (in this case, [team] and [constructor championships]).

19.4.2 Visual Approaches for Exploring Differences

19.4.2.1 Boxplots (for Continuous Variables Across Groups)

Boxplots are useful for visualising the distribution of a continuous variable across different categories or groups. You can compare the medians, interquartile ranges (IQRs), and potential outliers.

For example, we might want to compare podium finishes across different teams.

# Boxplot: Podiums by Team
ggplot(f1_data, aes(x = team, y = podiums, fill = team)) + 
  geom_boxplot() + 
  theme_minimal() + 
  ggtitle("Boxplot of Podium Finishes by Team") + 
  xlab("Team") + 
  ylab("Number of Podium Finishes")

Boxplots provide a visual comparison of the spread and central tendency (median) of the data across different teams, allowing us to see the differences in podium finishes.

19.4.2.2 Violin Plots (for Continuous Variables Across Groups)

Violin plots combine boxplots and density plots, providing a visual comparison of the distribution of a continuous variable across different groups, along with the probability density.

For example, we might want to visualise the distribution of race wins across different teams.

# Violin Plot: Race Wins by Team
ggplot(f1_data, aes(x = team, y = race_wins, fill = team)) + 
  geom_violin(trim = FALSE) + 
  theme_minimal() + 
  ggtitle("Violin Plot of Race Wins by Team") + 
  xlab("Team") + 
  ylab("Number of Race Wins")

The violin plot shows both the spread (via the width of the plot) and the density of race wins across different teams, helping to visualise differences between groups.

19.4.2.3 Bar Charts (for Categorical Variables)

Bar charts are great for comparing the frequency of categories or groups. They are particularly useful when comparing categorical variables.

For example, we might want to compare the number of drivers by team.

# Bar Chart: Drivers by Team
ggplot(f1_data, aes(x = team, fill = team)) + 
  geom_bar() + 
  theme_minimal() + 
  ggtitle("Number of Drivers by Team") + 
  xlab("Team") + 
  ylab("Number of Drivers")

This bar chart provides a quick comparison of the number of drivers in each team, helping to explore differences in the frequency of categorical variables.

19.4.2.4 Histograms (for Comparing Distributions of Continuous Variables)

Histograms are used to show the distribution of a continuous variable. We can overlay histograms for different groups to compare their distributions.

For example, we might want to compare the distribution of average race finishes between two teams.

# Subset data for two teams
team_data <- f1_data[f1_data$team %in% c("Mercedes", "Red Bull"), ]

# Histogram: Comparing race finish distributions between two teams
ggplot(team_data, aes(x = avg_race_finish, fill = team)) + 
  geom_histogram(position = "identity", alpha = 0.5, bins = 20) + 
  theme_minimal() + 
  ggtitle("Distribution of Average Race Finishes by Team") + 
  xlab("Average Race Finish") + 
  ylab("Count")

This histogram overlays the distributions of average race finishes for “Mercedes” and “Red Bull” drivers, allowing us to visually compare their performance differences.

19.5 Practical - Exploring Relationships and Differences

rm(list=ls())
rugby_data <- read.csv('https://www.dropbox.com/scl/fi/0obv8hjpqn1zmei9xlwdo/rugby_data.csv?rlkey=nnogefb1bmu6qno7bxgsov7xk&dl=1')

19.5.1 Conduct a T-Test

Compare the average number of tries scored between players in the “Forward” and “Back” positions.

Use a t-test to determine if there is a statistically significant difference between the two groups.

Hint: Use the t.test() function.

Show solution
# T-Test: Compare tries scored between Forwards and Backs
t_test_result <- t.test(tries_scored ~ position, data = rugby_data)
t_test_result

19.5.2 Conduct a One-Way ANOVA

Compare the average number of tackles made across players in “Team A”, “Team B”, and “Team C”. Use a one-way ANOVA to determine if there is a statistically significant difference in tackles made between the teams.

Hint: Use the aov() function.

Show solution
# One-Way ANOVA: Compare tackles made across teams
anova_result <- aov(tackles_made ~ team, data = rugby_data)
summary(anova_result)

19.5.3 Conduct a Chi-Square Test

Test whether there is a significant association between the player position (Forward/Back) and the fitness level (ordinal variable, 1 to 5). Use a chi-square test.

Hint: Use the table() function to create a contingency table and the chisq.test() function for the test.

Show solution
# Chi-Square Test: Association between position and fitness level
chisq_test_result <- chisq.test(table(rugby_data$position, rugby_data$fitness_level))
chisq_test_result

19.5.4 Create a Boxplot

Visualise the distribution of tackles made across the different positions (“Forward” and “Back”) using a boxplot. This will help you compare the central tendency and spread of tackles made for each position.

Hint: Use ggplot2 and geom_boxplot().

Show solution
# Boxplot: Tackles made by position
library(ggplot2)
ggplot(rugby_data, aes(x = position, y = tackles_made, fill = position)) + 
  geom_boxplot() + 
  theme_minimal() + 
  ggtitle("Boxplot of Tackles Made by Position") + 
  xlab("Position") + 
  ylab("Number of Tackles Made")

19.5.5 Create a Violin Plot

Create a violin plot to visualize the distribution of tries scored across the different teams (“Team A”, “Team B”, “Team C”). The violin plot should show both the distribution shape and the spread of tries scored.

Hint: Use geom_violin().

Show solution
# Violin Plot: Tries scored by team
ggplot(rugby_data, aes(x = team, y = tries_scored, fill = team)) + 
  geom_violin(trim = FALSE) + 
  theme_minimal() + 
  ggtitle("Violin Plot of Tries Scored by Team") + 
  xlab("Team") + 
  ylab("Number of Tries Scored")

19.5.6 Create a Bar Chart

Create a bar chart to compare the number of players by fitness level across the dataset. This chart will help you see the distribution of players across the different fitness levels (1 to 5).

Hint: Use geom_bar().

Show solution
# Bar Chart: Players by Fitness Level
ggplot(rugby_data, aes(x = as.factor(fitness_level), fill = as.factor(fitness_level))) + 
  geom_bar() + 
  theme_minimal() + 
  ggtitle("Number of Players by Fitness Level") + 
  xlab("Fitness Level") + 
  ylab("Number of Players")

19.5.7 Conduct a Correlation Analysis

Explore the relationships between continuous variables, such as player age, years of experience, tries scored, tackles made, and injuries sustained. Compute the correlation matrix and visualise it using a heatmap.

Hint: Use the cor() function for the correlation matrix and ggplot2 to visualise it.

Show solution
# Correlation matrix for continuous variables
cor_matrix <- cor(rugby_data[, c("player_age", "years_experience", "tries_scored", "tackles_made", "injuries_sustained")])

# Reshape correlation matrix for heatmap
melted_corr <- melt(cor_matrix)

# Create heatmap
ggplot(melted_corr, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile(color = "white") + 
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), name = "Correlation") + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Correlation Heatmap of Continuous Variables")

19.5.8 Create a Histogram

Visualise the distribution of player ages in the dataset using a histogram. This will help you understand how player ages are distributed across the dataset.

Hint: Use geom_histogram().

Show solution
# Histogram: Distribution of player age
ggplot(rugby_data, aes(x = player_age)) + 
  geom_histogram(binwidth = 2, fill = "blue", color = "black") + 
  theme_minimal() + 
  ggtitle("Distribution of Player Age") + 
  xlab("Player Age") + 
  ylab("Count")

19.6 Extension Task

The following task is best completed in pairs. See how far you can go with it…it’s a good test of everything we’ve covered to this point in the module (including the pre-class reading for this week!)

19.6.1 Part 1: Data Cleaning and Preparation

19.6.1.1 Import the Dataset

Data is stored at: https://www.dropbox.com/s/x3anwwmvx6huua7/sport_data.csv?dl=1

  • Step 1: Load the [sport_data.csv] file into R using the appropriate import function.

  • Step 2: Inspect the first few rows and structure of the dataset (head(), str(), summary()).

  • Step 3: Check for any issues like incorrectly formatted columns (e.g., [is_foreign] should be a factor, not numeric).

Checklist:

  • Is the dataset successfully imported?
  • Do you understand the structure of the data?

19.6.1.2 Handle Missing Data

  • Step 1: Use is.na() to identify missing values across all columns.

  • Step 2: Decide whether to remove rows/columns with missing data or impute them. If imputing (assuming you know what this is), choose a method (mean, median, or predictive model).

  • Step 3: Implement your chosen approach in the R script and provide a justification in comments.

Checklist:

  • You’ve identified and addressed missing data?
  • You can provide a justification for your method of handling missing values?

19.6.1.3 Identify Outliers

  • Step 1: Use boxplots (ggplot2 or boxplot()) to visually identify outliers in the goals, assists, market_value, and salary columns.

  • Step 2: Choose a method to handle the outliers (e.g., removal, capping at a certain quantile, or leaving them).

  • Step 3: Implement the changes in R and justify your decision with comments.

Checklist:

  • Any outliers identified visually?
  • You’ve handled outliers appropriately and provided reasoning?

19.6.1.4 Modify Data

  • Step 1: Create a new variable [efficiency], defined as [goals] / [minutes_played].

  • Step 2: Remove columns that you deem unnecessary for further analysis (e.g., [contract_length], [team_performance]). Justify your choice.

  • Step 3: Rename or reorder columns if needed for clarity.

Checklist:

  • A new [efficiency] variable created?
  • All unnecessary columns removed, with reasoning provided?

19.6.2 Part 2: Descriptive Statistics and Visualisations

19.6.2.1 Descriptive Statistics

  • Step 1: Use summary() to generate basic descriptive statistics (mean, median, standard deviation) for goals, assists, minutes_played, market_value, and salary.

  • Step 2: Note any unusual or interesting insights from the descriptive statistics. For example, are the distributions skewed?

Checklist:

  • Descriptive statistics are calculated?
  • Any significant insights are noted?

19.6.2.2 Data Visualisation

  • Step 1: Create a boxplot comparing player market_value across different positions. (ggplot2::geom_boxplot())

  • Step 2: Create a scatter plot showing the relationship between goals and market_value.

  • Step 3: Create a bar chart showing the average [player_rating] for each position.

Checklist:

  • Boxplot created and differences between positions are visualised?
  • Scatter plot created to show the relationship between [goals] and [market_value]?
  • Bar chart has been created showing average player ratings by position?

19.6.2.3 Exploratory Data Analysis

  • Step 1: Use cor() to calculate correlations between numerical variables (goals, assists, minutes_played, market_value, etc.).

  • Step 2: Visualise the correlations using a heatmap (corrplot or ggplot2).

  • Step 3: Discuss any interesting or unexpected strong/weak correlations.

Checklist:

  • A correlation matrix calculated?
  • A heatmap of these relationships visualised?
  • Any significant correlations discussed?

19.6.3 Part 3: Predictive Analysis

19.6.3.1 Predictive Model: Player Market Value

  • Step 1: Split the dataset into training and test sets (e.g., 80/20 split using sample.split() from the caTools package).

  • Step 2: Build a linear regression model (lm()) to predict market_value based on goals, assists, minutes_played, age, player_rating, and position.

  • Step 3: Evaluate the model using performance metrics like R-squared and RMSE (summary() and Metrics::rmse()).

  • Step 4: Test the model on the holdout test set and compare its performance to the training set.

Checklist:

  • Model built and tested on both training and test sets?
  • You’ve evaluated the model’s performance using R-squared and RMSE?
  • You’ve commented on whether the model generalises well?

19.6.3.2 Classification Model: Foreign vs Domestic Players

Step 1: Build a logistic regression model (glm()) to classify whether a player is foreign (is_foreign) based on their performance statistics (goals, assists, minutes_played, age, market_value, player_rating). Step 2: Evaluate the model’s accuracy using a confusion matrix (caret::confusionMatrix()). Step 3: Calculate accuracy, precision, recall, and F1 score for model evaluation.

Checklist:

  • A logistic regression model has been built and evaluated?
  • A confusion matrix created?
  • Your model’s precision, recall, and F1 score calculated?

19.6.4 Part 3: Stretch Challenge

19.6.4.1 Advanced Data Manipulation

Step 1: Create a new column [high_performer], where a player is flagged as 1 if they meet all of the following conditions: - The player is foreign (is_foreign == 1), - Scored more than 10 goals, - Has a player_rating above 7. Step 2: Summarise how many high-performing foreign players there are in the dataset.

Checklist:

  • A new column [high_performer] created correctly?
  • The count of high-performing foreign players generated?

19.6.4.2 Feature Engineering for Predictive Models

Step 1: Create new interaction terms between [player_rating] and [goals], and [player_rating] and [market_value]. Step 2: Rebuild the linear regression model from Part 3, this time including these interaction terms as predictors. Step 3: Compare the performance of the new model with interaction terms against the original model. Checklist:

  • Interaction terms created and added to the model?
  • Model performance improved or degraded after adding interactions?
  • Discussed the impact of the new terms?

19.6.5 Possible Solution

#---------------------------------------
# Part 1: Data Cleaning and Preparation
#---------------------------------------

# 1. Importing data into a new dataframe called [data]
data <- read.csv("https://www.dropbox.com/s/x3anwwmvx6huua7/sport_data.csv?dl=1")
summary(data) # use summary to get a quick insight into how the data has been imported
   player_id         team_id          season         goals       
 Min.   :   1.0   Min.   : 1.00   Min.   :2016   Min.   : 0.000  
 1st Qu.: 250.8   1st Qu.: 5.00   1st Qu.:2017   1st Qu.: 5.000  
 Median : 500.5   Median :10.00   Median :2018   Median : 7.000  
 Mean   : 500.5   Mean   :10.14   Mean   :2018   Mean   : 6.902  
 3rd Qu.: 750.2   3rd Qu.:15.00   3rd Qu.:2019   3rd Qu.: 9.000  
 Max.   :1000.0   Max.   :20.00   Max.   :2020   Max.   :14.000  
    assists       minutes_played  yellow_cards    red_cards          age       
 Min.   : 0.000   Min.   : 503   Min.   :0.00   Min.   :0.000   Min.   :18.00  
 1st Qu.: 3.000   1st Qu.:1408   1st Qu.:1.00   1st Qu.:0.000   1st Qu.:23.00  
 Median : 5.000   Median :2252   Median :2.00   Median :0.000   Median :29.00  
 Mean   : 4.974   Mean   :2275   Mean   :2.14   Mean   :0.307   Mean   :29.26  
 3rd Qu.: 6.000   3rd Qu.:3176   3rd Qu.:3.00   3rd Qu.:1.000   3rd Qu.:35.00  
 Max.   :15.000   Max.   :3999   Max.   :8.00   Max.   :3.000   Max.   :40.00  
   height_cm       weight_kg         injuries         salary     
 Min.   :160.0   Min.   : 60.00   Min.   :0.000   Min.   : 0.52  
 1st Qu.:170.0   1st Qu.: 71.00   1st Qu.:1.000   1st Qu.: 5.64  
 Median :180.0   Median : 80.00   Median :1.000   Median :10.38  
 Mean   :180.2   Mean   : 80.12   Mean   :1.547   Mean   :10.39  
 3rd Qu.:191.0   3rd Qu.: 90.00   3rd Qu.:2.000   3rd Qu.:15.09  
 Max.   :200.0   Max.   :100.00   Max.   :5.000   Max.   :19.98  
  market_value     position         team_performance contract_length
 Min.   : 8.09   Length:1000        Min.   : 30.00   Min.   :1.000  
 1st Qu.:17.47   Class :character   1st Qu.: 48.00   1st Qu.:2.000  
 Median :21.32   Mode  :character   Median : 64.00   Median :3.000  
 Mean   :21.30                      Mean   : 64.73   Mean   :2.981  
 3rd Qu.:24.87                      3rd Qu.: 82.00   3rd Qu.:4.000  
 Max.   :38.07                      Max.   :100.00   Max.   :5.000  
 matches_played    is_foreign    player_rating   
 Min.   : 1.00   Min.   :0.000   Min.   : 4.000  
 1st Qu.:10.00   1st Qu.:0.000   1st Qu.: 5.600  
 Median :19.50   Median :0.000   Median : 7.000  
 Mean   :19.56   Mean   :0.293   Mean   : 7.035  
 3rd Qu.:29.00   3rd Qu.:1.000   3rd Qu.: 8.500  
 Max.   :38.00   Max.   :1.000   Max.   :10.000  
#---------------------------------------

# 2. Handling missing data
data_clean <- na.omit(data) # Example: Remove rows with missing data
# Alternatively: data$column[is.na(data$column)] <- median(data$column, na.rm = TRUE)


#---------------------------------------

# 3. Identifying Outliers
library(ggplot2)
boxplot(data$goals) # Check outliers visually

# Removing outliers - here I've chosen to use the quantiles to get rid of the very high and very low values, with a good rationale for my decision.

q <- quantile(data$goals, probs = c(0.01, 0.99))
data <- subset(data, goals > q[1] & goals < q[2])

#---------------------------------------

# 4. Modifying Data
data$efficiency <- data$goals / data$minutes_played
data <- data[ , !(names(data) %in% c("contract_length", "team_performance"))] # Remove unnecessary variables


#---------------------------------------
# Part 2: Descriptive Statistics and Visualisations
#---------------------------------------

# 5. Descriptive Statistics
summary(data[c("goals", "assists", "minutes_played", "market_value", "salary")])
     goals           assists       minutes_played  market_value  
 Min.   : 2.000   Min.   : 0.000   Min.   : 503   Min.   : 8.09  
 1st Qu.: 5.000   1st Qu.: 3.000   1st Qu.:1400   1st Qu.:17.52  
 Median : 7.000   Median : 5.000   Median :2246   Median :21.27  
 Mean   : 6.867   Mean   : 4.967   Mean   :2269   Mean   :21.24  
 3rd Qu.: 8.000   3rd Qu.: 6.000   3rd Qu.:3172   3rd Qu.:24.70  
 Max.   :12.000   Max.   :15.000   Max.   :3999   Max.   :36.16  
     salary      
 Min.   : 0.520  
 1st Qu.: 5.635  
 Median :10.380  
 Mean   :10.364  
 3rd Qu.:15.060  
 Max.   :19.980  
#---------------------------------------

# 6. Data Visualisation
ggplot(data, aes(x = position, y = market_value)) + geom_boxplot()

ggplot(data, aes(x = goals, y = market_value)) + geom_point()

ggplot(data, aes(x = position, y = player_rating)) + geom_bar(stat="summary", fun="mean")

#---------------------------------------

# 7. Exploratory Data Analysis
library(corrplot)
corrplot 0.92 loaded
corr_matrix <- cor(data[c("goals", "assists", "minutes_played", "market_value", "player_rating")])
corrplot(corr_matrix, method="color")

#---------------------------------------
# Part 3: Predictive Analysis
#---------------------------------------

# 8. Linear Regression
model <- lm(market_value ~ goals + assists + minutes_played + age + player_rating + position, data = data)
summary(model)

Call:
lm(formula = market_value ~ goals + assists + minutes_played + 
    age + player_rating + position, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2532 -2.4863  0.0106  2.3361  5.1725 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         4.779e+00  7.335e-01   6.515 1.17e-10 ***
goals               1.526e+00  3.961e-02  38.525  < 2e-16 ***
assists             1.176e+00  4.053e-02  29.012  < 2e-16 ***
minutes_played     -1.926e-05  9.194e-05  -0.209    0.834    
age                 4.983e-04  1.412e-02   0.035    0.972    
player_rating       4.182e-02  5.415e-02   0.772    0.440    
positionForward    -1.199e-01  2.604e-01  -0.460    0.645    
positionGoalkeeper -9.554e-02  2.681e-01  -0.356    0.722    
positionMidfielder -2.577e-01  2.650e-01  -0.972    0.331    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.884 on 966 degrees of freedom
Multiple R-squared:  0.7002,    Adjusted R-squared:  0.6977 
F-statistic:   282 on 8 and 966 DF,  p-value: < 2.2e-16
#---------------------------------------

# 9. Logistic Regression
log_model <- glm(is_foreign ~ goals + assists + minutes_played + age + player_rating + market_value, data = data, family = "binomial")
summary(log_model)

Call:
glm(formula = is_foreign ~ goals + assists + minutes_played + 
    age + player_rating + market_value, family = "binomial", 
    data = data)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)  
(Intercept)     6.082e-01  5.628e-01   1.081   0.2798  
goals           5.574e-02  4.849e-02   1.149   0.2504  
assists         5.282e-02  4.262e-02   1.239   0.2152  
minutes_played  2.521e-05  7.060e-05   0.357   0.7210  
age            -2.663e-02  1.087e-02  -2.449   0.0143 *
player_rating  -6.956e-02  4.159e-02  -1.673   0.0944 .
market_value   -4.445e-02  2.476e-02  -1.795   0.0726 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1176.4  on 974  degrees of freedom
Residual deviance: 1163.9  on 968  degrees of freedom
AIC: 1177.9

Number of Fisher Scoring iterations: 4
#---------------------------------------
# Part 3: Stretch Challenge
#---------------------------------------

# 10. High-performing Foreign Players
data$high_performer <- ifelse(data$is_foreign == 1 & data$goals > 10 & data$player_rating > 7, 1, 0)
summary(data$high_performer)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000000 0.000000 0.000000 0.009231 0.000000 1.000000 
#---------------------------------------

# 11. Feature Engineering
data$interaction1 <- data$player_rating * data$goals
data$interaction2 <- data$player_rating * data$market_value
model_with_interactions <- lm(market_value ~ goals + assists + minutes_played + age + player_rating + interaction1 + interaction2, data = data)
summary(model_with_interactions)

Call:
lm(formula = market_value ~ goals + assists + minutes_played + 
    age + player_rating + interaction1 + interaction2, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7267 -0.4187  0.0273  0.4495  5.0455 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     9.916e+00  4.363e-01  22.728  < 2e-16 ***
goals           1.571e+00  5.425e-02  28.966  < 2e-16 ***
assists         1.187e-01  1.704e-02   6.966 6.04e-12 ***
minutes_played -3.261e-05  2.897e-05  -1.126    0.261    
age            -1.565e-04  4.443e-03  -0.035    0.972    
player_rating  -1.350e+00  5.654e-02 -23.883  < 2e-16 ***
interaction1   -2.012e-01  7.742e-03 -25.989  < 2e-16 ***
interaction2    1.287e-01  1.375e-03  93.584  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9095 on 967 degrees of freedom
Multiple R-squared:  0.9702,    Adjusted R-squared:  0.9699 
F-statistic:  4491 on 7 and 967 DF,  p-value: < 2.2e-16